Skip to content

feat: vf v1 <> nano bridge#2742

Draft
mikasenghaas wants to merge 101 commits into
mainfrom
feat/nano-as-v1
Draft

feat: vf v1 <> nano bridge#2742
mikasenghaas wants to merge 101 commits into
mainfrom
feat/nano-as-v1

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 9, 2026

Copy link
Copy Markdown
Member

Companion PR to PrimeIntellect-ai/verifiers#1576 for verifiers v1 training integration.

mikasenghaas and others added 30 commits June 8, 2026 17:05
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Points the submodule at the vf-nano EnvServer branch so the orchestrator can
build on the env-server abstraction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Switch prime-rl's env path to vf-nano: the orchestrator spawns a vf-nano
EnvServer per env (it never loads an environment), dispatches rollouts by task
index, and trains on the returned Trace dicts (branches + renderer tokens).

- pyproject: dep verifiers -> vf-nano; drop v1/research env packages; only the
  vf-nano reverse-text example; override out the transitive v1 verifiers (pulled
  by the prime CLI) so it can't shadow vf-nano's `verifiers` package; add orjson
  /pandas/msgspec (were transitive via verifiers).
- EnvConfig inherits vf-nano's swappable agent/runtime (+ max_turns).
- envs.py: spawn EnvServer child + EnvClient, info() for num_tasks/group-scoring,
  dispatch by task_idx, adapt Trace -> RolloutOutput-shaped dict.
- trajectories.py: trace_to_samples (one sample per Trace branch) + trace_to_output.
- train_source: index sampling; client pool builds vf-nano ClientConfig; lag
  monitor vendored; env-server entrypoint repointed; ~14 files retyped off
  vf.RolloutOutput / vf.ClientConfig.
- configs/debug/vf_nano_reverse_text.toml.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er config)

- trace_to_samples stitches each Trace branch's tokens into one TrainingSample
  (prompt = branch start, then each turn's new context [masked] + generated
  tokens [trained]); drop the RolloutOutput adapter — read the Trace's native
  fields directly (reward, error{type,message}, timing generation/scoring,
  num_turns, branches).
- envs returns the raw Trace; eval_sink / train_sink / dispatcher / metrics /
  orchestrator read native Trace fields (no token_usage/completion/timing.total).
- client pool forwards the shared renderers.RendererConfig to the env server's
  renderer client (so it uses qwen3, not the tool-less default fallback).
- debug config: tool_call_parser=hermes (vLLM accepts the agent's tools),
  max_steps=20.
- bump deps/vf-nano.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o timeout)

- Env.run_rollout/run_group pass the vf-nano ClientConfig object and a
  SamplingConfig (built from the env's sampling args) directly — no model_dump,
  no per-rollout timeout forwarded to the server.
- debug config: max_steps=20.
- bump deps/vf-nano (typed env-server RPC).
The env server returns a Trace minus its derived fields; the orchestrator resolves
the env's Task subclass (from config.id) and validates the wire dict into a strict
Trace[EnvTask], so the whole orchestrator works with a real, typed vf.Trace —
typed task fields included (e.g. task.answer), nothing subscriptable.

- envs.py: resolve_task_type(env_id); run_rollout/run_group validate -> Trace[EnvTask].
- trajectories/types/dispatcher/train_sink/eval_sink/metrics/filters/advantage/utils
  /orchestrator: attribute access on the typed Trace (reward, error{type,message},
  branches, timing.<span>.duration, num_turns, ...); derived fields recompute on the
  consumer.
- Task/Trace/TimeSpan stay strict (StrictBaseModel) — no extra=ignore anywhere.
- bump deps/vf-nano.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orchestrator spawns the env server, so request the serve extra
(zmq/msgpack) explicitly now that vf-nano keeps them out of core.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`from __future__ import annotations` already defers all annotations to strings,
so the quotes + `# noqa: F821` on the TYPE_CHECKING-only `vf.Trace` / `TrainRollout`
annotations are unnecessary (no import cycle — verifiers.nano never imports prime_rl).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The field holds a typed vf.Trace, so `trace` reads truer than `raw` (which
suggested an unparsed dict). Renames the field + every `.raw` access, the
`emit_rollout(trace=...)` param/kwarg, the to_dict field filter, and the
dispatcher cancel-path locals.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop the FinishedRollout proxy properties (error/reward/is_truncated and the
  example_id field); consumers now read r.trace.{reward,is_truncated,task.idx,...}
  directly. The trace is the single source of truth.
- Use vf.Trace.has_error for existence checks instead of `.error is not None`.
- Replace the prime-rl trace_* token-length utils with vf.Trace.{completion_len,
  total_tokens,has_response} (now on the trace); keep trace_to_samples.
- Carry task_idx end-to-end (GroupState.task_idx, env.run_rollout/run_group(task_idx),
  source dict key) instead of the example/example_id dict carrier; identity comes
  off trace.task.idx.
- Mark the local-package env arrangement as a temporary/experimental TODO.
- Move the debug config to configs/debug/nano/reverse_text.toml.
- Bump deps/vf-nano (Trace/Turn accessors).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- The env server binds tcp://127.0.0.1:0 and reports its concrete address back
  over a queue; the orchestrator connects to that. Removes _get_free_port and its
  TOCTOU race (the OS assigns the port atomically).
- A spawned server has already bound + loaded by the time it reports its address,
  so the untimed info() is enough — only poll wait_for_server_startup for an
  external (config.address) server, which has no spawn handshake.
- Bump deps/vf-nano (port report + Trace/Branch token-length accessors).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Task-subclass introspection now lives in vf-nano (vf.task_type); drop the
prime-rl copy and build the typed Trace via vf.Trace[vf.task_type(env_id)]. Bump
deps/vf-nano.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SFT trains on a teacher served over the chat client, which returns no token ids,
so the trace's turns have tokens=None and trace_to_samples yields nothing. Restore
backfill: for each tokenless turn, render its prompt + assistant response with the
student chat template and split on the longest common prefix to fill TurnTokens
(masks/logprobs come from trace_to_samples). train_sink.process_rollout backfills
when any turn lacks tokens, before building samples.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
drop_group's error_rollout_output calls omitted the required task_idx, so an
off-policy cancel (on_new_version) raised TypeError. Use the group's task_idx
(or -1 when the group is already gone), mirroring handle_completed_rollout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- envs.py: EnvClient now returns Trace[WireTask]; upgrade to this env's real Task
  subclass via self.trace_type.model_validate(wire.to_wire()).
- dispatcher.py: drop the error_rollout_output helper — inline the synthetic error
  Trace at each call site using vf.Error's field names (type/message/traceback); the
  task-exception path carries a real traceback, cancels/empty-trajectory carry none.
- Bump deps/vf-nano.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nical

- Spawned env servers now route their output (logging + subprocess-runtime output)
  to <output_dir>/logs/envs/<name>.log via a _run_env_server wrapper that redirects
  stdout/stderr and sets up logging in the child. Previously the orchestrator-spawned
  server logged nowhere.
- Debug config: batch_size 16->128, group_size 8->16, eval num_examples 8->128
  (interval=1), matching configs/debug/training_modes/rl.toml.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orchestrator already passes a train/eval-split log_dir (.../logs/envs/train,
.../logs/envs/eval), so _spawn must drop the file directly under it
(<log_dir>/<name>.log) rather than re-adding an envs/ subdir — which had buried the
train/eval split under logs/envs/<kind>/envs/<name>.log.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instead of the orchestrator sidecar-spawning each env server as an mp child, the
rl launcher now spawns one `env-server` process per env (train + eval), each on a
free port, with output to logs/envs/{kind}/{name}.log and a crash monitor — same
model as inference/trainer. It sets env.address in the orchestrator config so the
orchestrator attaches (its existing external path) instead of spawning. Envs that
already set address (user-managed external server) are left alone; the orchestrator's
mp sidecar stays as the fallback for running `orchestrator` directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add RLConfig.env_server_base_port (default 5000); the i-th launcher-managed env binds
base_port + i. Drops the get_free_port dependency in the launcher.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Train envs bind base_port + i; eval envs bind base_port + ENV_SERVER_KIND_STRIDE + i
(stride 1000), so each kind has headroom for many envs without the blocks colliding
(was a single running index — train and eval sat adjacent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- env_server entrypoint: intercept vf-nano stdlib logging so the server's own logs
  (EnvServer up, request failures) land in logs/envs/<kind>/<name>.log — previously
  only loguru output was captured, swallowing them.
- envs.py: close the address-handoff mp.Queue after use (no resource_tracker
  leaked-semaphore warning on the sidecar path).
- configs/debug/nano/reverse_text.toml: drop the eval block, mirroring
  examples/reverse_text/rl.toml (train-only smoke; eval path validated separately).
- bump deps/vf-nano (serve/types docstring trim).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…irectly

The I/O boundary (save_rollouts + monitor sample tables) now dumps the typed
vf.Trace itself (r.trace.model_dump(mode="json")) instead of a Trace+metadata
merge — the on-disk rollout is just the trace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vf-nano renamed its rollout-driver abstraction Agent -> Harness. Update the
integration: EnvConfig.agent -> harness (HarnessConfig/DefaultHarnessConfig);
env.run_rollout/run_group spawn forwards harness_config; the env-server entrypoint
passes harness_config/harness_timeout; debug config uses `harness = {...}`. Bump
deps/vf-nano to the renamed branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mikasenghaas and others added 20 commits June 11, 2026 07:11
Add r2e-gym-v1 to the base v1 taskset deps + uv sources (editable from
deps/verifiers/examples/tasksets/r2e_gym_v1) so the id resolves through
the v1 loader, matching the other -v1 tasksets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- v0 configs/rlm_swe/qwen35_4b.toml: restore the train env to r2e and the
  eval env to swebench-verified-quick (as on main), reverting the scaleswe switch
- v1: rename configs/debug/v1/scaleswe.toml -> r2e_gym.toml, point the train env
  at the r2e-gym-v1 taskset, and drop the eval block

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apply the edits the prior rename commit missed:
- v0 rlm_swe/qwen35_4b.toml: train -> r2e, eval -> swebench-verified-quick (as on main)
- v1 debug/v1/r2e_gym.toml: taskset -> r2e-gym-v1, eval block removed

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Env servers spawn their worker pool as fresh `spawn` processes with no logging
handlers (verifiers#1626), so per-rollout logs (rollout start/done, context-exceed
warnings) were silently dropped. Pass `setup_env_server_logging` to verifiers'
`serve_env` as `log_setup`; it runs in the broker and in every worker. A worker
inherits the broker's redirected stdout/stderr, so its logs land in the same
`envs/{train,eval}/<name>.log` as before — no new files or paths.

Bumps deps/verifiers to the worker-logging fix.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Realign the pin onto origin/feat/nano-as-v1 and pick up #1627: the --rich
dashboard's token counts fall back to provider usage when the endpoint returns
no token ids (no more 0/0). The prior pin 3df34ba5 was a pre-rebase #1626
variant; 955b6cdf already contains the equivalent #1626 (env-server worker
logging) plus #1627.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the serve_env SIGTERM-teardown fix: pool/in-process env servers no
longer print a spurious KeyboardInterrupt traceback into the env logs on
shutdown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the verifiers floor bump so the renderers offset-tokenizer fix (dev40,
PRs #72/#75) can't be undercut by a pre-fix PyPI resolution. Re-locks uv.lock to
the dev40 specifier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1628 (reap the whole subprocess tree when a runtime run is cancelled).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2774)

* feat(v1): elastic env-server pool (inherit pool config from verifiers)

Companion to verifiers#1629. prime-rl's EnvConfig now extends vf.EnvServerConfig, so
each env inherits the `pool` discriminated union (static{num_workers=4} |
elastic{max_workers=None, multiplex=128}, default elastic) and the orchestrator's env
servers scale workers on demand instead of pre-spawning a fixed `auto` count.

- Drop the per-env / train-group / eval-group `num_workers` fields + the auto-resolution
  (ceil(max_inflight/256)); the elastic pool self-sizes from load.
- envs.py / env_server.py pass `vf.pool_serve_kwargs(env.pool)` to serve_env.
- Bump deps/verifiers to the elastic-pool branch.

Breaking: `num_workers` is replaced by `pool`. Configs set `pool = { type = "elastic",
multiplex = N }` or `{ type = "static", num_workers = N }`; the rlm_swe + r2e debug
configs are migrated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): back-compat shim mapping legacy num_workers -> pool

EnvConfig forbids extra fields, so configs still setting the removed `num_workers`
would hard-fail. Add a `model_validator(mode="before")` that maps it onto `pool`:
an int -> a fixed `static` pool, `"auto"` -> the default `elastic` pool; an explicit
`pool` always wins. Keeps existing (incl. out-of-tree) configs parsing without edits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): drop num_workers from rlm_swe + r2e configs (use default elastic pool)

The default `pool` is already elastic (multiplex 128), so an explicit `pool` here was
redundant — just remove the legacy `num_workers` and inherit the default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Realign the pin onto origin/feat/nano-as-v1: the prior pin d0c5bc98 was the
unsquashed #1629 feature branch, now squash-merged as f404e97f
(content-identical). Picks up #1629 (static/elastic env-server pool config).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1631 (per-rollout setup timing as a distinct phase) and #1632
(per-call model + runtime retries).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_wire validation)

Fixes RunRolloutResponse ValidationError 'trace.timing.setup.duration: Extra
inputs are not permitted' that crashed every rollout (#1636 drops computed
durations from to_wire).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1638 (add --resume for evals: re-run a previous run's
missing/errored rollouts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…[WireTask]) (#2781)

* chore(v1): stop importing env modules in the orchestrator

The orchestrator built its per-env trace_type as Rollout[vf.task_type(env_id)] for v1 envs, and
vf.task_type imports the env package just to read its Task subclass for typing the wire trace.
Nothing reads typed env task fields - only task.idx and a full task.model_dump - and WireTask
(extra="allow") preserves those fields (incl. on disk). Always use Rollout[vf.WireTask], so the
orchestrator never imports an env package: the env's type and runtime both live only in the
server process.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): hoist the constant Rollout[WireTask] to a module-level ROLLOUT_TYPE

It no longer varies per env, so it doesn't belong as a per-instance attribute set in
Env.__init__ - lift it to a module constant used directly in run_rollout/run_group.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mikasenghaas and others added 9 commits June 12, 2026 14:20
* fix(v1): cap hendrycks-sanity scoring at 10s

Without a scoring timeout (the default is no limit), a wedged math verify holds its
rollout's permit forever — sympy can spin past the in-script alarm — and at 512
concurrency that starves the pool and stalls long runs. Set timeout.scoring = 10 on the
train and eval envs so the framework cancels and the subprocess runtime kills a runaway
verify, freeing the permit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: drop inline comment on the scoring timeout

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mports (#2792)

Bump deps/verifiers to feat/nano-as-v1 HEAD (8873a740), which includes verifiers#1654 — the v1
interception rework: role-named clients (EvalClient/TrainClient), route-detected wire dialects
(chat/responses/anthropic), 1:1 relay + streaming, reasoning preserved.

Adopt the renamed client config classes in prime_rl/utils/client.py:
OpenAIClientConfig -> EvalClientConfig, RendererClientConfig -> TrainClientConfig.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts, #1660)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It was only a manual editable install, so `uv sync` pruned it. Add it to the env dependency
group + [tool.uv.sources] (mirroring r2e-gym-v1) so it persists across syncs and is available
out of the box.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
verifiers#1653 (carry mm tensors across the env-server wire) is merged and pinned, so
`MessageNode.multi_modal_data` is no longer `exclude=True` — `model_dump(mode="json")` now
serializes the base64 pixel tensors into `train_rollouts.jsonl` and the wandb sample tables,
bloating every line. They're the training `mm_kwargs` carrier, not part of the rollout
record, so exclude them at the dump boundary (train + eval paths).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Declare the remaining 7 verifiers v1 example tasksets (code-golf, deepwiki,
glossary, swelego, wiki-search, wikispeedia, wordle) as editable deps so uv sync
installs every example, matching the verifiers examples set. chromadb/textarena
were already present via the v0 wiki-search/wordle envs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The example harness (examples/harnesses/compact) was missing from prime-rl deps,
so the documented --harness.id compact branching example failed to resolve
(ModuleNotFoundError: harness compact not found). Declare it like the example
tasksets so uv sync installs it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants